A new iterative initialization of EM algorithm for Gaussian mixture models

Background The expectation maximization (EM) algorithm is a common tool for estimating the parameters of Gaussian mixture models (GMM). However, it is highly sensitive to initial value and easily gets trapped in a local optimum. Method To address these problems, a new iterative method of EM initialization (MRIPEM) is proposed in this paper. It incorporates the ideas of multiple restarts, iterations and clustering. In particular, the mean vector and covariance matrix of sample are calculated as the initial values of the iteration. Then, the optimal feature vector is selected from the candidate feature vectors by the maximum Mahalanobis distance as a new partition vector for clustering. The parameter values are renewed continuously according to the clustering results. Results To verify the applicability of the MRIPEM, we compared it with other two popular initialization methods on simulated and real datasets, respectively. The comparison results of the three stochastic algorithms indicate that MRIPEM algorithm is comparable in relatively high dimensions and high overlaps and significantly better in low dimensions and low overlaps.


Introduction
Gaussian mixture model (GMM) is a very useful tool, which is widely used in complex probability distribution modeling, such as data classification [1], image classification and segmentation [2][3][4], speech recognition [5], etc. The Gaussian mixture model is composed of K single Gaussian distributions. For a single Gaussian distribution, the parameters are usually estimated by using the maximum likelihood estimation (MLE) method, but this is not applicable to GMM. In fact, it is not known in advance which component each observation belongs to in GMM due to introduce the hidden variables. The expectation-maximization method (EM), introduced by Dempster et al. [6], can complete the parameter estimation of the GMM by iteratively constructing the lower limit of the likelihood function to continuously improve the value of the likelihood function. However, this procedure is highly sensitive to initialization and easily gets trapped in a local optimum [7,8]. As shown in Fig 1, when the likelihood function is nonconvex, the EM algorithm may terminate the iteration before reaching the global optimum. Thus, initialization research is very necessary for EM algorithm. At present, many initialization methods of the EM algorithm for GMM have been proposed, which are mainly divided into two situations with a known and an unknown numbers of components K. When the number of components K is known, some initialization methods belong to the deterministic strategy. For example, the initialization method in the document [9] is based on hierarchical agglomerative clustering (HAC), which uses Ward's criterion measurement to obtain the means of the initial model. The procedure proposed by Maitra relies on detecting the best local modes [10]. The deterministic methods may lead to an incorrect solution or even no solution when the likelihood function is unbounded [11]. In comparison, stochastic initialization strategies can bring improvements with the increase of runs. The standard program of stochastic initialization is the multiple restart method (MREM) [12]. In this approach, the EM algorithm is run many times with different random initial conditions. The parameters corresponding to the highest log likelihood are returned as the final parameter estimation. When the sample size is large, the method is still easy to fall into local optimization. The emEM [13] and RndEM [10] algorithms are the other two typical representatives of stochastic initialization strategies. The emEM algorithm starts with the phase called short EM at which the EM algorithm iterates multiple times from different random points according to a lax convergence criterion. The set of parameters that produced the highest likelihood value is then used for the final long EM run. RndEM is a variant of emEM algorithm. It involves the same stages as emEM, but the short EM phase stops after the very first parameter estimation. In the paper of Blömer and Bujna [14], two new initialization methods are presented based on the well-known K-means++ algorithm and the Gonzalez algorithm. In the method of rndmaxmin [15], the initialization mean is selected from the random subset of candidate eigenvectors by application of Mahalanobis distance. The covariance matrix of the component is initialized by randomly generating the eigenvalues and eigenvectors.
The above approaches generally need to give the number of components K in advance. In fact, if K is unknown, it can be assumed firstly to belong to a set of values K. For each different K 2 K, the initialization method and EM algorithm are run. The K value of best fitting model is to be chosen based on some test procedures or criteria (usually log-likelihood value). Moreover, many initialization methods integrate the estimation of K in the process of calculating parameters. A clustering algorithm using fast search of density peak points (DPC) [16] can predict the number of components K and centers of the initial class. The method of S-EM [17] initializes the mean vectors by choosing points that have a higher concentration of neighbors and the covariance matrix by using a truncated normal distribution. In the paper of Verbeek et al. [18], a greedy algorithm is presented which does not need an initialization but needs to construct a whole sequence of mixture models with m = 1, . . ., K components instead. Subsequently, the optimized greedy initialization methods are proposed [19,20]. Another method is to apply Rough-Enhanced-Bayes mixture estimation (REBMIX) to the initialization of the EM algorithm [21]. In fact, this approach combines a new type of clustering algorithm with the EM algorithm to improve the accuracy of the results, and the approach itself is not innovative.
Clearly, there is no "the best initialization algorithm" that can be applied to all instances. The performance of initialization depends on two aspects: one is the data itself, including the degree of overlap, sample size, and dimension. The other is the allowable computation cost. Besides, the choice of hyperparameters for some algorithms also has a crucial impact on the final results.
In order to optimize the initial value of the EM algorithm and the estimated results, we propose a new iterative initialization method of EM algorithm for GMM under the situation with a known K in this paper. This method is to calculate the mean vector and covariance matrix of sample as the initial value of the iteration rather than to start with many different random initial conditions. Then, the optimal feature vector is selected from the candidate feature vectors by the maximum Mahalanobis distance as a new partition vector for clustering. The parameter values are renewed continuously according to the clustering results. To verify its applicability, we compared the proposed method with other two popular initialization methods on simulated and real datasets, respectively.

EM algorithm for Gaussian mixture models
For d-dimensional random variable X with n samples, the probability distribution of a finite Gaussian mixture model can be expressed by a weighted sum of K components [22]: where α m is m-th mixing proportion, which must satisfy α m > 0, m = 1, . . ., K and P K m¼1 a m ¼ 1.
where d is the dimension of the feature space and |�| denotes the determinant of a matrix. Thus, Θ = {α 1 , . . ., α m , μ 1 , S 1 , . . ., μ m , S m } is an unknown set of the parameters and it has to be estimated in the mixture learning process. The number of components K is either known or must be determined in the learning process. In this paper it is assumed that K is known. Suppose X = {x 1 , x 2 , . . ., x n } is independent and identically distributed. For K = 1, Θ can be solved by MLE: Where L(Θ) refers to the likelihood function and is given by: For K > 1, the solution of this maximization problem cannot be obtained in a closed form. As a numerical optimization method, EM algorithm is often used for such cases. EM algorithm is an iterative optimization strategy. Each iteration is divided into two steps called the expectation step (E-step) and the maximization step (M-step). For GMM, it can be divided into the following steps after giving the initial parameters Θ (0) [23]: • E-step: The posterior probability of the s-th observation belonging to the K-th component, The E-steps and M-steps are iterated until a convergence criterion is met. In this paper, A convergence criterion of relative improvement based on log likelihood is adopted: where ε � 1 was a pre-specified tolerance level (in this paper, ε = 1e − 5).

Initialization procedure
In this section, the proposed methodology (denoted as MRIPEM) is introduced under the premise that the number of components K is known. The initialization method of this paper involves the idea of clustering. Starting from the number of clusters is 1, the mean vectors, covariance matrices, and mixing proportions are gradually renewed in each iteration. The initialization procedure can be described in detail by the following steps: 1. Compute the population mean vector μ 1 and covariance matrix S 1 by MLE and set m = 2.
2. Choose t (t is a parameter of the method) feature vectors randomly from X to form X l = {x l1 , x l2 , . . ., x lt }. For each x li �X l , compute the minimal squared Mahalanobis distance [24] to the mean vectors μ 1 , μ 2 , . . ., μ m−1 that have already been chosen: where d 2 m ðm j ; S j ; x li Þ is given by: 3. Select p m ¼ arg max 4. Divide X into m partitions {C 1 , C 2 , . . ., C m } by using Euclidean distance, i.e. assign sample points x i to the nearest partition vector.
5. According to the partition {C 1 , C 2 , . . ., C m } obtained in step 4, update the mean vectors, covariance matrices and mixing proportions of each component: where |?| represents the number of sample points in the ?.
6. If S j is not positive definite, use a spherical covariance matrix instead: where I D denotes the d-dimensional identity matrix.
7. m = m + 1 if m < K then go to step 2. Otherwise, terminate the algorithm.
8. Set the manual loop parameter r that represents the number of runs of the initialization method and take the results corresponding to the best Adjusted Rand Index (ARI) [25] value as the output of the final initialization.

Properties of the initialization procedure
The iterative initialization method of this paper is a stochastic strategy, and it depends on samples completely. If K = 1, the initial parameter value is obtained in step 1 by using MLE. The algorithm iterates from step 2, gradually increasing the number of partitions until X is divided into K components.
The main idea of step 2 and step 3 is to use the minimal squared Mahalanobis distance to sift out the next optimal partition vector that is used for dividing the sample. When dividing the samples, our purpose is to make the selected partition vectors as far as possible from each other. This can reduce the probability of different components of samples being divided into the same component. In step 2, the t sample points are selected randomly from X as the candidate vectors. Then, the minimal squared Mahalanobis distance is calculated from the t candidate vectors to the determined m − 1 centers. The m-th partition vector is determined according to the maximum distance calculated above (Step 3). It should be noted that Mahalanobis distance is selected as the standard for determining the partition vector because it is an effective multivariate distance metric taking into account the relationship between various feature vectors [26]. In addition, the sensitivity study on the selection of t for different numbers of K(K = 2, 3, 4, 5, 7, 9) was made according to ARI value. In fact, ARI is an evaluation index proposed by Hubert and Arabie [25] to measure the clustering performance. The larger the ARI value, the more consistent the clustering results of the two partitions. When ARI takes a negative value, it indicates that the labels of the two partitions are independently distributed. When ARI is equal to 1, it indicates that the two partitions are identical.
As can be seen from Fig 2, when K ⩽ 5 and t ⩽ K, the ARI value rise rapidly until t reaches K. When t > K, the ARI value gradually tends to a plateau. Therefore, we define t = K for K < 5. Similarly, when K > 5, the ARI value gradually grows with the increase of t until t = 5. The ARI value fluctuates little after t > 5. Therefore, we define t = 5 for K � 5. The simulation experiments indicate that we also can get excellent results by bringing t = 5 (if K � 5) sample points into step2 while the sample size is much larger than t. Choosing t instead of all n sample points means that the amount of iteration calculation will be greatly reduced, thus, the running speed of the algorithm is accelerated.
The sample points are clustered (Step 4) through the new partition vector (Step 3). The clustering results are used to estimate the m-th parameter values, which are mean vectors, covariance matrices, and mixing proportions.
Step 5 lists the calculation formula of each parameter. In step 6, the covariance matrix is replaced by the spherical covariance matrix when it is not positive definite. This usually happens when the cluster is too small at the early stage of the iteration. The spherical covariance matrix is not only a positive definite matrix, but also can be constructed quickly without being affected by the iterative process. The parameter values are continuously updated during the process of iteration. The iteration stops and the last results are the outputs of the initialization method (Step 7) When m = K.
Due to the random nature of the algorithm, the results will vary from run to run. In order to get the best initialization results, we have added a loop parameter r inside the initialization procedure representing the number of times to run our initialization method (Step 8). In r runs, ARI is a performance criterion to evaluate the results of each initialization. The initialization results corresponding to the maximum value of ARI is selected as the output of our initialization algorithm.
Step 8 can be adjusted manually. Generally speaking, the greater the number of runs r, the better the initialization results. However, in order to improve the operation efficiency, we attempted to conduct a sensitivity analysis between the number of runs and the initialization results. The r(r = 1, 5, 10, 15, 20) experiments were conducted under three kinds of dimensions (p = 2, 5, 10), respectively. Considering the randomness of the algorithm, each group of experiments (r, p) was repeated 30 times and the ARI value was determined using the ARI mean of 30 repeated experiments. As show in Fig 3, the best performances are obtained under three kinds of dimensions when r = 10. Therefore, the manual loop parameter r is fixed to 10 in the subsequent experiments of this paper.

Results
In this section, the proposed initialization method is combined with the EM algorithm to estimate the parameters of the GMM through the clustering of the data, which is tested on the artificially generated datasets and real datasets, respectively. Meanwhile, our approach is compared with two other popular stochastic initialization strategies, which are emEM and RndEM.
The ARI value and the log-likelihood (logL(Θ)) were used to measure the performance of the data clustering. logL(Θ) is the objective function of the EM algorithm, and the larger it is,

PLOS ONE
the better is. The combination of the two performance criteria can accurately measure the effectiveness of algorithms and make the experimental results more credible.

Simulation study
The artificial datasets for simulation study are generated by using the MixSim package in R [27], which can generate a finite mixture model with Gaussian components for prespecified levels of maximum pairwise overlap � o, average pairwise overlap � o, the number of clusters K and the dimension of samples p.
Naturally, the dimensionality is considered to have an impact on the clustering performance of methods. The higher the dimension of the samples, the worse the clustering effect may be. Additionally, in

PLOS ONE
When three algorithms of emEM, RndEM and MRIPEM are compared, due to nature of stochastic strategies, we executed 30 different datasets for each triplet ðp; � o; � oÞ. In addition, to explore the stability of the results of 30 runs, we define var * : where var represents the variance of ARI for 30 results. The reason for 1000 times magnification operation is that the range of ARI value is [0, 1], and the calculated variance value is too small to distinguish. Besides, there is a large difference in the value of the log likelihood function calculated from 30 different data, so it is meaningless to calculate its variance. Here we only calculate the variant of ARI variance (var * ).

Results of simulation study
The comparison results of p = 2, p = 5 and p = 10 Gaussian mixture model datasets are listed in Tables 1-3 under three stochastic initialization methods, respectively. � I and � L represent the averages of ARI and logL(Θ) of 30 different datasets, respectively. I q and L q represent their corresponding third quartile. The values of ARI are accurate to 4 decimal places and the results of others are rounded up to 2 decimal places. The optimal value of each index has been highlighted in bold in the table.
Obviously, the three initialization methods perform differently with varying degrees of overlap and dimensionality. As show in Table 1, the values of five indexes indicate that MRI-PEM is significantly better than emEM and RndEM for six different combinations of overlap. When � o ¼ 0:0001 and � o ¼ 0:001, the � I value of emEM and RndEM are both below 0.9 and differ greatly from MRIPEM. When � o ¼ 0:01, although the difference of � I values among the three algorithms is reduced, the method of MRIPEM is still the best. Particularly, when w = 0.0001, the I q value of MRIPEM is close to 1. This displays that more than 75% of the results in 30 repeated experiments trials are consistent with the real classification results. Besides, the var * value of MRIPEM is the smallest compared with the other two algorithms, indicating that the obtained results of MRIPEM are relatively stable. The results of Table 2 illustrate that the situation of p = 5 is similar to that p = 2. In terms of the averages of ARI and logL(Θ), MRIPEM is also better than the other two methods for six overlap combinations. For the corresponding third quartile, when � o ¼ 0:01 and � o ¼ 0:3, the result of MRIPEM is slightly smaller than that of emEM. However, the difference of I q is only 0.26% and the difference of L q is only 1.27%. In Table 3, when the dimension is increased to 10, the results of MRI-PEM are still best for � o ¼ f0:0001; 0:001g. It's a pity that MRIPEM provides the worst results compared to the other two methods when � o ¼ 0:01. Nevertheless, the MRIPEM method still remains relatively stable to a certain extent from the results of var * .

PLOS ONE
In a word, the comparisons of ARI reveal that the MRIPEM method has a higher accuracy and can reduce the probability of data being wrongly divided. The comparisons of log-likelihood function values show that the parameters calculated by the MRIPEM method have high fitting validity. According to the values of var * , the MRIPEM method is relatively stable and has little volatility compared with the other two initialization methods. In addition, it can be found that the three stochastic initialization methods have different degrees of decline with the increase of the degree of overlap and dimension. This reflects from another aspect that the degree of data overlap and dimension are the important reasons that affect the performance of the algorithm.
In order to better present the specific information for results of the 30 datasets, we extract four representative groups from the 18 triplets, and draw the boxplots of ARI and logL(Θ) in On the premise of fixing the total number of samples n and the number of clusters K, the results of the simulation experiments show that MRIPEM is significantly better than the other two classic algorithms when the degrees of overlap( � o ¼ f0:0001; 0:001g) are not high. The MRIPEM method achieves the highest results on both ARI and logL(Θ). The Adjusted Rand Index is close to 1, almost perfect classification. In fact, the result of MRIPEM even appeared several times with ARI = 1 during the process of simulation. For the fp; � og ¼ f10; 0:01g, the MRIPEM method is inferior to the other two methods. Maybe the high degree of overlap and dimensionality increase the probability of selecting partition vectors from the same component by using maximum Mahalanobis distance. Therefore, it seems that our initialization algorithm is suitable for well-separated and low-dimensional samples. In fact, the performance of all three algorithms will decrease in varying degrees with the increase of overlap and dimension. It's obvious that more overlapping data increases the difficulty of distinguishing them.

Real datasets study
In this section, we use datasets of four known class labels from UCI machine learning database [28] and KEEL-dataset repository [29] to demonstrate the validity of the proposed method, namely Seeds, Aff, Appendicitis, and SKM. These datasets vary from dimension of feature space, sample size, number of classes, and degree of overlap.

Seeds dataset
The Seeds dataset is a commonly used classification experimental dataset. The dataset contains 210 instances, which belong to three different varieties of wheat: Kama, Rosa, and

PLOS ONE
Canadian. There are 70 instances in each category, and each instance contains 7 attributes: area A, perimeter P, compactness C, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. According to the seven attributes, which variety each seed belongs to can be predicted.

Aff dataset
The full name of Aff dataset is Algerian forest fires. It includes 244 instances that regroup data of two regions of Algeria, namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria, with 122 instances for each region. Through 7 attributes, 244 examples were divided into "fire" (138 classes) and "not fire" (106 classes).

Appendicitis dataset
The data represents 7 medical measures taken over 106 patients on which the class label represents if the patient has appendicitis (class label 1) or not (class label 0).

SKM dataset
It is the real dataset about the students' knowledge status about the subject of Electrical DC Machines. In accordance to five attributes including the degree of study time for the goal object materials (STG), it can be divided the knowledge level of students.
The above datasets all have known their respective dimension sizes, but the degrees of overlap are unknown. To get the level of clustering complexity for an existing classification dataset, the average pairwise overlap and maximum pairwise overlap of four real datasets were calculated by using the overlap function of MixSim package in R [27]. The information of four real datasets is listed in Table 4 as follows:

Results of real datasets study
As in the simulation experiments, considering that the three initialization methods are all stochastic strategies, this experiment was repeated 30 times and each time was started with a different seed of the random number generator. The results of three initialization methods under four real datasets are displayed in Table 5. Fig 9 intuitively shows the difference of ARI and logL(Θ) values of the three methods under the four real datasets. Due to the large difference between the logL(Θ) values of different datasets, the logL(Θ) results is converted into a proportional form (For each real dataset, logL(Θ) of each method / the sum of logL(Θ) of three methods).
It can be seen from Table 5 and Fig 9 that the ARI value of the MRIPEM is the largest among the three methods in four kinds of datasets. In terms of logL(Θ), MRIPEM is the largest except on the Seeds dataset. These results demonstrate that the classification results of MRI-PEM algorithm are closer to those of real datasets. In more detail, the MRIPEM method is

Discussion and conclusion
In this paper, a novel iterative initialization method of the EM algorithm for the estimation of the parameters of GMM was proposed. This new method, named MRIPEM, incorporates the ideas of multiple restarts, iterations, and clustering. The performance of MRIPEM was assessed by ARI and logL(Θ) indexes, and compared with the other two popular initialization strategies on simulated datasets and real datasets, respectively. In fact, the initial mean and covariance matrix of the iteration of the MRIPEM algorithm are calculated by taking all the sample data as one component, which is different from other algorithms to randomly generate the initial mean and variance. According to the Mahalanobis distance, the candidate vector farthest from all the determined center vector points is taken as the partition vector. In this way, the greater the distance, the greater the probability of being selected as the partition vector, otherwise the probability is smaller. That is, the probability of proper initialization of all component centers increases. After clustering the sample data according to the partition vector, the mean and variance of each component are continuously updated in each iteration. Obviously, the MRIPEM algorithm completely depends on the sample dataset itself, which reduces the randomness to a certain extent. In addition, the setting of the manually adjustable loop parameter r can also reduce the randomness to a certain extent and improve the accuracy of the results. The comparison results of simulation datasets and real datasets under the three algorithms confirm that the MRIPEM algorithm shows excellent accuracy and robustness in low dimensions and low overlaps. Therefore, the MRIPEM is recommended to apply to the well-separated datasets of the low dimension or after dimensionality reduction.

PLOS ONE
However, it is a well-known fact that no single method can outperform the others in all cases, and our algorithm is no exception. Similar to other initialization methods, the performance of the MRIPEM method also decreases when the degree of overlap and the dimensionality are both high. Maybe the high dimension and high overlap will increase the probability of misclassification of partition vectors determined by the maximum Mahalanobis distance. Nevertheless, on the whole, the MRIPEM method can still be regarded as a promising algorithm in terms of situations in low-dimension and low-overlap.

Statements
The real datasets that support the findings of this study are openly available in UCI machine learning database at https://archive.ics.uci.edu/ml and KEEL-dataset repository at https:// sci2s.ugr.es/keel/datasets.php (Ref [28,29]).